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ABSTRACT 


Students’ questions categorization is a challenging task as 
the available corpora are often limited in size (particularly 
with languages other than English) and require a costly pre- 
liminary manual annotation to train the classifiers. Ensem- 
ble learning can help improve machine learning results by 
combining several models, and is particularly efficient to 
leverage the strengths of very different classifiers. In this 
paper, we investigate how combining a rule-based annota- 
tor (based on keywords identified by an expert) with var- 
ious machine learning-based approaches and TF-IDF can 
improve the automated identification of questions asked by 
1st year medicine students on an online platform, according 
to a coding scheme using 4 dimensions. First we evaluated 
the performance of several models, calculating the kappa 
between the prediction and the manually labelled dataset, 
according to each dimension. Then, using a stacking ap- 
proach, we tried different combinations of them to design a 
predictive model with a higher performance. The results re- 
veal that the new ensemble models can help to increase the 
performance for all dimensions of the dataset, in particu- 
lar those for which the expert rule-based system showed the 
lowest performance. These results are promising as they in- 
dicate that some easy-to-train models can complement more 
manual approaches, even with a small training set of a few 
hundreds of annotated questions. 


Keywords 


Student’s question, ensemble learning, stacking, coding scheme, 


hybrid method, question categorization 


1. INTRODUCTION 


Categorizing students’ questions with limited size of cor- 
pora remains a challenging task. The available classification 
methods require a costly preliminary manual annotation to 
obtain labeled data, and it is tempting to try training many 
different classifiers in the hope that combining their predic- 
tions would give good performance. One of the most active 
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areas in machine learning has been in studying methods to 
build good ensembles of classifiers [3]. The premise that en- 
sembles are often much more accurate than the individual 
classifiers make them more attractive. Ensemble learning 
helps improve machine learning by combining several mod- 
els to obtain an overall classifier which prediction accuracy 
outperforms every single one of them [7, 9]. Among the ex- 
isting ensemble approaches, stacking [17] is often used. It 
consists in training a combining classifier (sometimes called 
a meta-classifier), in addition to the set of individual classi- 
fiers, which takes as input the output of the other classifiers. 


In this paper, we used a pre-existing coding scheme to anno- 
tate students’ questions asked by Ist year medicine students 
on an online platform, and investigated different approaches 
to improve the automated identification of their questions. 
We used the stacking approach by combining heterogeneous 
classifiers such as a rule-based annotator with various ma- 
chine learning-based approaches and TF-IDF. Our goal was 
to answer to two research questions: (RQ1) Can combining 
different approaches improve the performance of the predic- 
tive model? (RQ2) What is the best combination of families 
of classifiers, and in particular, can a hybrid system (mixing 
expert rules and machine learning) be relevant? 


2. STATE OF THE ART 


Annotating a corpus automatically requires using a coding 
scheme or a taxonomy of sentences, messages or speech acts. 
Many taxonomies have been used to characterize the types o 
questions that students ask. Graesser and Person [4] devel- 
oped a taxonomy of questions asked during tutoring sessions 
according to the level of thought. Another well-known tax- 
onomy widely used in education, the Bloom’s taxonomy [2 
and its revisions [1], was originally created in order to help 
teachers in formulating questions and therefore tends to be 
more appropriate for teachers’ questions (e.g. assessment 
than for students’ ones. The questions coding scheme used 
here was defined based on the corpus at hand to match them 
more accurately, even if some overlap exists with categories 
of existing taxonomies. 


Ensemble learning has been investigated in many studies 
[11, 10] and consists in weighting several individual classi- 
fiers, and combining them in order to obtain a classifier that 
outperforms every single one of them considered separately. 
First, different classifiers are generated by training models 
on different features from the training set. The generated 
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classifiers are then typically combined by some form of ma- 
jority or weighted voting. In this work, we do not restrict 
how the individual classifiers are trained, but we deal with 
different models and not only probabilistic ones which is a 
prerequisite for some of these techniques. 


Table 1: Coding scheme 
(ID [Dimi | Question type 


1 Ree Re-explain / rede- | Ask for an explanation al- 
fine ready done in the course 
material 


2 Dee Deepen a concept Broaden a knowledge, clar- 
ify an ambiguity or request 
for a better understanding 

3 Ver Validation / verifi- | Verify or validate a formu- 

Explanation Description 
modality / 
Quest. subject 


The stacking framework introduced by [17], consists in com- 
bining multiple classifiers models created by using different 
learning algorithms on a single dataset. Several variations 
of this idea have been attempted. Ting and Witten stacked 
base-level classifiers whose predictions are probability distri- 
butions over the set of class values, rather than single class 


values. The meta-level attributes are thus the probabilities T a Example Example application 
of each of the class values returned by each of the base-level _ eee (course/exercise) 

classifiers. Multi-response linear regression, an adaptation 2 Sch 
of linear regression, is recommended for meta-level learning explanation about it 


to predict binary variables [15]. Merz [8] proposed a stack- 
ing method called SCANN that uses correspondence analysis 
to detect correlations between the predictions of base-level 
classifiers. Todorovski and Dzeroski [16] introduced a new 
meta-level learning method for combining classifiers with 
stacking: meta decision trees have base-level classifiers in the 
leaves, instead of class-value predictions. Researchers in [14] 


3 Co Correction Correction of an exercise in 
pf fee | coursefexam | 
Dim3 | Explanation Description 
eee epee eee oad 


Define a concept or term 


a Def 
M 
Ri 


r 
: eos “a 
ceed 


a 
ea 


Ask Tor the reason 


presented a novel bayesian model that relies on combining 
different models in order to improve the classifier accuracy. 


In this paper, we investigated how combining heterogeneous 
classifiers (derived by different learning algorithms, using 
different model representations) can help to improve the 
automated identification of questions using a stacking ap- 
proach. 


3. CONTEXT 


The dataset considered in this paper is made of questions 
asked in 2012 by 1st year medicine/pharmacy students from 
a major public French university. The Faculty of Medicine 
and Pharmacy has a specific hybrid training system (reading 
of the material for the class is done at home, and classroom 
time is dedicated mostly to Q&A) for their 1st year stu- 
dents. The 1st year is divided into two semesters, each of 
them ending with a competitive exam (in January and May) 
on the content studied during the period: only a predefined 
number of students is allowed to continue in the second year. 
Between the reading session at home and the classroom ses- 
sion, the students can connect to an online platform to either 
ask a question, or see questions asked by other students and 
vote for them if they also want an answer to that question. 
They cannot however propose an answer to those questions. 
Then, the questions asked online are sent to teachers to help 
them prepare their Q&A session. The dataset contains 6457 
questions overall asked by 429 students. 


4. QUESTION CODING SCHEME 


We chose to consider the nature of questions as defined in 
the coding scheme introduced in [5], in a process involv- 
ing multiple human coders and several refinement phases. 
This coding scheme (summarized in Table 1) consists in tag- 
ging each question according to 4 independent dimensions: a 
main one (dimension 1), which is mandatory, and 3 optional 
ones (dimensions 2 to 4 - cf. Table 1 for the corresponding 
definition of each dimension). For instance, a question could 
be a request to re-explain the way something work by pro- 
viding another example (tagged as Ree (1) on dimension 1, 


Link between con- | Verify a link between two 
Verification Description 
ke contra- etect mis- 


diction take/contradiction in 
course or explanation of 


teacher 
Knowledge in | Verify knowledge 
Expected knowl- | Verify expected informa- 
edge tion in exam or quiz (as- 
sessment) 


Exa (1) on dimension 2, Man (2) on dimension 3, and noth- 
ing (0) on dimension 4, i.e. represented as the dimension 
vector [1,1,2,0]). It could also be a request to verify (Ver 
= 3) if a schema (Sch = 2) needs to be known for the final 
exam (Exp = 3) (which would be tagged as [3, 2, 0, 3]). We 
considered here a corpus of 923 questions manually anno- 
tated according to the 4 dimensions of the coding scheme 
for training and testing the automated annotators. A sub- 
sample of 723 questions was used for training the classifiers, 
and 200 questions were used to test their performance. 


5. AUTOMATED ANNOTATORS 


5.1 Approach 1: expert rule-based annotator 
We used first a previously developped custom annotator re- 
lying on keywords manually identified and associating them 
a weight [6]. To design it, the human annotator identified 
from a separate dataset of questions in the corpus the key- 
words that were indicative of each dimension value (e.g. in 
Dimension1, for the dimension value “Re-explain”, some of 
the keywords identified were “re-explain”, “restate”, “rede- 
fine”, “retry”, “repeat”, “revise”, ”get back on”, etc.). For 
each question, for each dimension, the question was tagged 
in the dimension according to the value that had the highest 
number of keywords associated to it (e.g. for dimension 1, 
a question with two keywords associated to the value ”re- 
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Table 2: Kappa between the standalone expert rule- 
based annotator and the reference manual annota- 
tion 


explain” and one keyword associated to the value ”valida- 
tion” would be tagged as a ”re-explain” question). 


The automatic annotator is using a set of weights associ- 
ated to each keyword of each dimension (e.g. “explain”: 7, 
“what/how”: 3), and defined using the set of 723 questions. 
Those weights were determined in two steps: first, the hu- 
man annotators empirically associated a weight between 1 
and 9 to each keyword, depending on whether they thought 
they were very marginally (1), significantly (5) or very sig- 
nificantly (9) associated to a given dimension. Then in a 
second step, the automatic annotator was used on the 723 
manually annotated questions, and weights were manually 
adjusted (adding or removing 1) on some keywords for ques- 
tions for which the manual and automatic annotation were 
different, iterating until full agreement was obtained on al- 
most all segments from the 723 questions. Finally, the ques- 
tion identified by the values associated to each dimension, 
is represented as a dimension vector. 


The Kappa values per dimension for the annotations coming 
from human expert and the automatic annotator are given 
in Table 2. 


5.2 Statistical approaches 


5.2.1 Data coding: from questions to words vectors 
First, we used the French version of WordNet [12], a lexi- 
cal database linking semantic concepts to each other in an 
ontology according to a variety of semantic relations (e.g. 
synonyms and hyperonyms). The aim was to transform dif- 
ferent synonyms into the same expression in the questions 
(e.g. for dimension value “Reason” in Dim3, the synonym 
words “cause”, “reason” and “motif” were replaced in the text 
by “why”) to reduce the lexical diversity and consolidate 
a particular expression for the following treatment. Then, 
the classical preprocessing steps are used on the corpus of 
923 questions: punctuation removal, stemming, tokeniza- 
tion, and stopwords filtering. After extracting the unigrams 
and bigrams for all questions in the corpus, the weights for 
the words are computed using two different methods: (1) 
TF-IDF (described in the next section), (2) counting oc- 
curencies (’1’ if the word is in the question, ’0’ otherwise). 
Each of the 723 questions was represented by a word vector 
according to (1) or (2). We finally reduced the number of 
extracted keywords to keep the most important and signifi- 
cant ones using a feature selection technique (removing less 
frequent and correlated unigrams and bigrams). 


5.2.2. Approach 2: TF-IDF 

We used TF-IDF [13] to compute term weights. The goal of 
TF-IDF is to estimate how the words in a given document 
are representative of that document when compared to a 
larger set of documents. It combines two complementary 
metrics: the term frequency (TF), and the invert document 


frequency (IDF). TF thus gives a higher weight to the com- 
monly occurring terms and a lower weight to rare terms. 
The drawback is that some words that are common in a 
given document but also common in all documents could 
end up with a weight that is over-representing their real im- 
portance. IDF fixes this issue by adjusting the weight with 
the general importance of the term. Equation 1 describes 
the method to compute individual TF-IDF weight values 
for each term (word). We made two different calculation 
measures of TF-IDF on the corpus of 723 questions. 


Wik = T Fir : log (=) (1) 


k 
Where: 


Wi, = TF-IDF weight for term k in document Q; 
TFix = frequency of term k in document Q; 
IDF; = inverse document frequency of term k in doc- 


ument Q; = log (=) 


N = total number of documents in the questions cor- 


pus 
Nz = number of documents in the corpus that contain 
the term k 


The first version consists in calculating four separate TF- 
IDF on each of the four dimensions, to extract the words that 
differentiate each category on each dimension. For a given 
dimension, all the questions manually annotated in each cat- 
egory (e.g. “Re-explain”) were considered as documents (e.g. 
on dimension 1, document1 is the union of questions anno- 
tated as “Ree”). Each document (set of questions) is con- 
verted into a corresponding word-weight vector, where each 
word-weight represents the TF-IDF measure for the word in 
the document. TF-IDF weight (W;j;,) was attributed for each 
term k in document % (i is the number of documents in that 
dimension, e.g. 7 varying from 1 to 3 for dimension 1). In 
order to classify new questions, we used the TF-IDF weights 
calculated on each dimension value from the sample of 723 
questions. We attributed TF-IDF weights calculated on the 
training sample for the corresponding words on the test sam- 
ple of 200 questions. Then, we chose the simplest ranking 
function which consists in summing the TF-IDF weights for 
each question on each dimension value. Therefore, for each 
question, for each dimension, we tag the question in that 
dimension according to the value that has the maximum 
weights. Finally, we calculated the Kappa values between 
the values found by this model [TF-IDF+MAX] for that 
dimension, and the corresponding values found by the man- 
ual annotation (cf. first column of Table 3). Two versions 
were tested: one where the questions were preprocessed us- 
ing WordNet (cf. previous section) and one where they were 
not. The results obtained were similar in terms of perfor- 
mance, so we decided to keep the version including the pre- 
processing with WordNet, as it intuitively should generalize 
better to variations of existing questions. 


In the second version, TF-IDF was calculated on the corpus 
of 723 questions without distinguishing the different dimen- 
sions. The questions were not grouped by dimension value, 
but instead, each question in the corpus was considered as 
a document (i.e. 723 documents overall). The document 
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Table 3: Kappa between automatic annotation ob- 
tained by standalone TF-IDF + different ML meth- 
ods and the reference manual annotation 
TFIDF + 
Dim. | Max | GLM | GBT | NB | KNN | DT | RI 


[Dimi [0.66 | 0.60 | 0.71 [0.47 | 0.62 [0.46 | 0.0T 


Dim? |-0.39_|_0.73_|_0.69_| 0.12 | 0.56 | 0.49 | 0.36 | 
[Dim3_[0.66_|_059_|0.60_| 0.43 | 058 [0.37 | 052 | 
[Dimd [058 | 071 | 0.63 [037] 060 pom] 0 | 


is then converted into a corresponding word-weight vector, 
where each word-weight represents the TF-IDF measure for 
the word in the question. Finally, we used the word vectors 
as the input for machine learning techniques to predict the 
value associated to the question in that dimension (described 
in section 5.2.3). 


5.2.3 Approach 3: ML-based annotator 

We tried 6 machine learning (ML) classification techniques 
(Generalized Linear Model [GLM], Gradient Boosted Trees 
[GBT], Decision Tree [DT], K Nearest Neighbors [K-NN], 
Rule Induction [RI] and Naive Bayes [NB]) for each dimen- 
sion separately. The appropriate hyper-parameters (such as 
k for K-NN) were chosen in each case to obtain the high- 
est value and may differ from one table to another. For 
each classifier, the input was the word vectors and the la- 
bel to predict was the value associated to the question in 
that dimension. We considered the corpus of 923 questions 
as labeled data. Then, we trained the classifiers on the 723 
questions and evaluated their performance on an indepen- 
dent sample of 200 questions, to ensure a good estimation 
of the performance on unseen data. Finally, we calculated 
the Kappa values between the values found by the classi- 
fication model for that dimension, and the corresponding 
values found by the manual annotation. We considered two 
versions for comparison here as well: the first one using the 
corpus processed using WordNet, and the second one with- 
out the processing with WordNet. 


5.3 Results 


The kappa values found with the three automated anno- 
tators taken individually (expert rule-based, TF-IDF and 
ML) are provided in Tables 2, 3 and 4 respectively for each 
dimension. We note that the expert rule-based annotator 
clearly outperforms both ML-based annotator and TF-IDF 
only on dimension 1, whereas they almost have similar per- 
formances on dimension 3. TF-IDF with the classifier GLM 
gives the best performance on dimension 4. Furthermore, 
the ML-based annotation without WordNet performs better 
than the classifiers using WordNet for all dimensions and 
particularly on dimension 2. 


6. ENSEMBLE HYBRID APPROACH 


Our next step consists in building a predictive model with 
a higher performance to improve the automated identifica- 
tion of questions according to the coding scheme provided 
in Table 1. Using the aforementioned stacking approach, we 
tried different combinations of models regardless of which 
classifier is the best one. Moreover, it does not require any 
of the classifiers to be probabilistic; they can even be human 
experts. Our goal was not only to obtain the best classifier 


Table 4: Kappa between automatic annotation ob- 
tained by standalone different ML methods and the 
reference manual annotation 
[Dim. | GLM [ GBT | NB [| K-NN] DT | RT] 
Pint] 067] 0.70 [OR] 0.60 [0.78] 000 | 
[Dims | 0.68_|_0.64_[0.37 | 0.61 _| 0.59 | 0.60 | 
[Dim | 0.63 _| 0.66 [0.37 | 0.60] 0.48 | 0.68 | 
Processing without WordNet 
Dint[ O73 | 0.69 [033] 0.56 [0.74 | 006 | 
[Dims [0.70 _|_0.65_[0.35 | 0.60_| 0.57 | 0.62 | 


performance, but also to do so using a fairly small training 
set of annotated questions and see if a good performance 
could be obtained nonetheless. 


6.1 Method for stacking 


In the first phase, a set of 20 base-level models have been 
created (1 expert rule-based annotation, 7 TF-IDF annota- 
tion and 12 ML-based annotation). In this second phase, we 
want to train a meta-level classifier that combines the out- 
puts of the base-level models. In other words, we have 20 
predictions for each dimension for each of the 200 question 
segments in the testing set, as well as the 20 manual annota- 
tions for these 200 segments that provide a grounded truth, 
and we want to train a classification model using some sub- 
sets of these 20 features. We trained the meta-level classifier 
using the same aforementioned 6 classification techniques 
(GBT, GLM, NB, K-NN, DT, and RI) for each dimension 
separately, using a 10-fold cross validation to ensure a good 
estimation of the performance (i.e. training the models on 
180 segments and testing on 20). Finally, for each model we 
calculated the Kappa values between the values found by 
that meta-model for that dimension, and the corresponding 
values found by the manual annotation. Regarding the set of 
features we considered, we wanted to consider combinations 
that mixed different set of approaches, and we therefore con- 
sidered six meta-learning combinations described below. For 
each of them, the training was performed four times (once 
for each of the four dimensions - cf. Figure 1). 


(1) Stacked TF-IDF models: We combined the outputs 
of the methods using each individiual TF-IDF classifier to 
compute keywords weights (7.e. 7 features for each classifier, 
cf. Table 3). 


(2) Stacking TF-IDF with expert rule-based anno- 
tation: We combined the outputs of the TF-IDF models 
with the output of the expert rule-based annotator (i.e. 8 
features for each classifier, cf. Tables 3 and 2). 


(3) Stacked ML techniques: We combined the outputs 
of the machine learning-based annotation with the two com- 
binations: processing using WordNet and without it (7.e. 12 
features for each classifier, cf. Table 4). 


(4) Stacking ML techniques with expert rule-based 
annotation: We combined the outputs of the machine 
learning-based annotation (with and without WordNet) with 
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Expert rule-based 
classifier 


Machine learning 
classifier 


TF-IOF 
classifier 


Stacking Kappa 


Dimension 1 


Dimension 2 


Dimension 3 


Dimension 4 


Figure 1: The overall stacking process 


the output of the expert rule-based annotation (i.e. 13 fea- 
tures for each classifier, cf. Tables 4 and 2). 


(5) Stacking ML techniques with TF-IDF: We com- 
bined the outputs of the machine learning-based annotation 
(with and without WordNet) with the output of TF-IDF 
based annotation (i.e. 19 features for each classifier, cf. Ta- 
bles 4 and 3). 


(6) Stacking ML, TF-IDF and expert rule-based an- 
notation: We combined the outputs of all the existing clas- 
sifiers: the machine learning-based annotation (with and 
without WordNet) with the output of TF-IDF and expert 
rule-based annotation (i.e. 20 features for each classifier, cf. 
Tables 4, 3 and 2). 


6.2 Results and discussion 

The kappa values found with the 6 classification techniques 
for each dimension are provided in Table 5. Each stacking 
model was trained individually on each dimension and the 
highest value obtained for each dimension among the 6 clas- 
sifiers is tagged in bold, for each set of features considered. 
For instance, on the first row, we see that when combining 
the 7 TF-IDF classifiers that predict dimension 1, the best 
stacking result is obtained with a decision tree (0.75), which 
outperforms the best individual TF-IDF classifier (0.71 with 
GBT, cf. Table 3). We can notice that Naive Bayes is of- 
ten the best ensemble classifier among the 6 tested, giving 
better performance on a small dataset. The best overall 
performance between the 6 set of comparisons are marked 
with a star (*): for dimension 1 and 4, it is Naive Bayes 
combining the ML and the expert rule-based classifiers, for 
dimension 2 it is Naive Bayes combining TF-IDF and the 
expert rule-based classifiers, and for dimension 3 it is GBT 
combining also TF-IDF and the expert rule-based classifiers. 


When considering the combinations involving TF-IDF, we 
see that the combination of several TF-IDF outperforms the 


base-level TF-IDF on dimension 1 and 3. The kappa values 
are overall lower on dimensions 2 and 4, which is proba- 
bly due to the unbalanced training data in these dimensions 
(it also explains why sometimes a classifier would obtain a 
kappa of 0 on these dimensions in the various tables). More- 
over, the various TF-IDF classifiers combined with expert- 
rule based annotator outperforms both the TF-IDF base- 
level and expert-rule based annotator, as well as the combi- 
nation of several TF-IDF. Similar results were found for sev- 
eral TF-IDF combined with machine learning, with a slightly 
better performance than individual classifiers. Overall, if 
one had to choose only one set of features, the best option is 
an hybrid ensemble (TF-IDF with expert rule-based annota- 
tor), which outperforms on average the model combinations 
with an average kappa of 0.77 (from the classifiers giving the 
best performance on each dimension, 7.e. NB on dimensions 
1 and 2, GBT on dimension 3 and K-NN on dimension 4). 


When considering the combinations involving ML-based clas- 
sifiers, the ML-based annotator combined with expert rule- 
based outperforms slightly the base-level machine learning 
on dimensions 1, 3 and 4 compared to the other ML com- 
binations. Similarly to TF-IDF, the hybrid ensemble (ML 
with expert rule-based annotator) gives an average kappa of 
0.77 instead of 0.74 for the base-level ML. 


The combination of the three types of approaches obtains a 
performance similar or lower than the two other previously 
mentioned hybrid ensembles. 


7. CONCLUSION 


In this paper, we have shown that even with a small train- 
ing set (less than 1000 questions), it can be useful to add 
ML-based approaches to complement a manually crafted an- 
notator using a stacking approach to combine classifiers with 
each other. Using an hybrid ensemble of machine learning- 
based (or TF-IDF-based) annotators with a previously ex- 
isting annotator seems to be the best approach, leveraging 
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Table 5: Kappa values between the ensemble models 
and the reference manual annotation 
(Dim. [CIM | GBT] NB_|[ K-NN] DT] RT] 
[Ding [0 | 0.35 0.67 | 0.9 | 051] 0_| 
Dims | 0.62 | -0.70_| 0.66_| 0.67] 0.68_| 0.66 | 
[Dima [0.55 | 0.67 | 0.68 | 0.69 | 0.69 | 0.67 | 
[Dim_ | GEM [GBT [NB [K-NN] DT] RT_ 
Dima [0 | 0.30 | 0.80" [0.66 | 048 0 
Dind | 0.60 | 0.66 | 072 | 0.73 | 067 | 0.65 | 
Dim. | GEM] GBT [| NB_[K-NN] DT] RT_ 
[Dima [0.30 [048 [0.77 | 0.59 | 0.2 [ 0_| 
[Dim. [GLIM [GBT] NB_[K-NN| DT] RT, 
[Dimt [077 | 077 _[o8or[ 0.76 [0.70 | 0.09, 
[Dime [0.16 [048 _[ 0.77 [0.60 [0.02 | 0 
[Dims | 0.6t | 0.76-[ 071 | _0.73_| 0.66 | 0.67 | 
[Dimd [0.60 [0.06 _[ 0.74" [0.69 [0.63 [0.59 
Dim. [GIM | GBT | NB-[K-NN] DT] RT 
Dimnl 0.73 0.76 
[Dime [0.30 [052 [078 | 081 [oer 0 
[Dims [0.66 [0.75_[ O71 [0.72 [0.70 | 0.02 
[Dims [0.60 | 0.6 | O71 | 0.71 | 0.64 [0.61 | 
[Dim. [GLM | GBT | NB_[K-NN]| DT | RT] 
[Dima [0 —|0.56 
[Dimt | 0.61_[ 003 _[ 0.70 [0.69] 0.63 [0.51 


the benefits of each approach. Combining TF-IDF and ML- 
approaches, however, does not seem as relevant. In our case, 
the hybrid ensemble models helped in increasing the perfor- 
mance for almost all dimensions, thus replying positively to 
our two initial research questions. It is worth noting though 
that the use of WordNet to reduce the vocabulary did not 
help in increasing the classifiers performance in our case. 


One of the limits of this paper is that we considered only a 
single coding scheme and dataset. The increase in kappas 
can also be sometimes seen as modest, but this is to be put 
in perspective with the fact that human coders using this 
coding scheme rarely can reach a kappa superior to 0.75 on 
such a task. Moreover, one should note that the dimensions 
that were improved were the ones that were the furthest 
from the human coder performance. To conclude, we believe 
our result can open the perspective to easily improve the 
performance of various speech act and message annotators 
which often only rely on expert rules annotators. 
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